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Abstract — Observations where additive noise is present can for 
many models be grouped into a compound observation matrix, 
adiiering to the same type of model. There are many ways the 
observations can be stacked, for instance vertically, horizontally, 
or quadratically. An estimator for the spectrum of the underlying 
model can be formulated for each stacking scenario in the case 
of Gaussian noise. We compare these spectrum estimators for the 
different stacking scenarios, and show that all kinds of stacking 
actually decreases the variance when compared to just taking 
an average of the observations. We show that, regardless of 
the number of observations, the variance of the estimator is 
smallest when the compound observation matrix is made as 
square as possible. When the number of observations grow, 
however, it is shown that the difference between the estimators is 
marginal: Two stacking scenarios where the number of columns 
and rows grow to infinity are shown to have the same variance 
asymptotically, even if the asymptotic matrix aspect ratios differ. 
Only the cases of vertical and horizontal stackings display 
different behaviour, giving a higher variance asymptotically. 
Models where not all kinds of stackings are possible are also 
discussed. 

Index Terms — Gaussian matrices. Random Matrices, free con- 
volution, deconvolution, spectrum estimation. 



I. Introduction 

Random mattices find applications in many fields of re- 
search, such as digital communication mathematical fi- 
nance II2I and nuclear physics 13]. Free probability theory |4|, 
lH], [l6l, fl], fSl has strong connections with random matrix 
theory, and can be used for high dimensional statistical infer- 
ence by addressing the following questions: 

Given A, B two n x n independent square Hermitian (or 
symmetric) random matrices: 

1 ) Can one derive the eigenvalue distribution of A from the 
ones 0/ A + B and B ? 

2) Can one derive the eigenvalue distribution of A. from the 
ones of AB and B ? 

More generally, such questions can be asked starting with 
any functional of the involved random matrices. If 1) or 2) 
can be answered for given random matrices A and B, the 
corresponding operation for finding the eigenvalue distribu- 
tion is called deconvolution. Deconvolution can be easier to 
perform in the large n-limit, and the literature contains result 
in this respect both for Vandermonde matrices ||9l, ifTOl . and 
Gaussian matrices ifTTl . For Gaussian matrices, there also exist 
results in the finite regime [12|. The methods generally used to 
perform deconvolution are the Stieltjes transform method Iil3l . 
and the moments method ID, lfT4l . In this contribution we will 
focus on the latter, which is based on the relations between 
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the moments of the matrices involved. The p-th moment of an 
n X n random matrix A is defined as 



;[tr(Af)] - J \Pdp{X) (1) 

where E[-] is the expectation, tr the normalized trace, and 
dp = E (i ^"^j^ i5(A — Ai)) the associated empirical mean 
measure, with Xi the eigenvalues of A. Both the Stieltjes 
transform and the moments can be used to retrieve eigenvalues, 
and can therefore be used for spectrum estimation. For many 
types of random matrices, converges almost surely when 
71 —> 00 to an analytical expression t^, depending only 
on some specific parameters, such as the distribution of the 
entries of A. This enables to reduce the dimensionality of 
the problem, and simplifies the computation of convolution of 
measures. 

Expressions for deconvolution turn out to be quite simple 
if asymptotic freeness (E] is assumed, which is the case for 
large Gaussian matrices. However, freeness does not hold in 
the finite case. In this respect, lfT2l modifies the moment-based 
free probability framework so that it applies for Gaussian 
matrices in the finite regime. The goal of this contribution is to 
address a particular question on how this framework best can 
be adapted for spectrum estimation purposes. The observations 
in some random matrix models allow for stacking into a larger 
compound observation matrix. When this is possible, some 
questions arise which do not seem to have been covered in 
the literature: 

1) Can the compound observation matrix be put into the 
same finite dimensional inference framework? 

2) Is one stacking of the observations better than another, 
for the purpose of spectrum estimation? 

A popular way of combining observations is the so-called 
sample covariance matrix, which essentially results from 
stacking observations of a random vector horizontally into a 
compound matrix. Results in this paper will actually challenge 
this construction for certain random matrix models, showing 
that it is not always the best way of combining observations. 
Another facet of stacking is that it can make asymptotic 
results more applicable, and eliminate the need for results in 
the finite-dimensional regime. This can be very nice, since 
asymptotic results can be simpler to obtain, and have a nicer 
form. However, we will also present a model where stacking 
can only be partially applied. The framework [12| has also 
been applied in a situation where it is not clear how to apply 
stacking in any way 1 15|. We will give a partial answer to the 
above questions in this paper, in the sense that we characterize 
the stacking of observations which is optimal in terms of the 
variance of the corresponding spectrum estimators, and we 
characterize what we gain in comparison with methods where 
observations are not stacked. 
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The paper is organized as follows. Section provides 
background essentials on random matrix theory needed to 
state the main results. In particular, we define different ways 
of stacking observations, and random matrix models we will 
analyze which allow for such stacking. Section states the 
main result, which concerns optimal stackings with the frame- 
work IfTSl . The main result can be stated without the actual 
definition of the spectrum estimator in question, which is 
therefore delayed till Section|IV] There we also prove formally 
that this estimator is unbiased, and state an expression for it, 
useful for implementing it and for proving the main result. In 
section |V] we present some useful simulations verifying the 
results. 

II. Random matrix Background Essentials 

In the following, upper boldface symbols will be used 
for matrices, and lower symbols will represent scalar values. 
(.)^ will denote the transpose operator, (.)* conjugation, and 
{.)^ — ((•)^) hermitian transpose. I„ will represent the nxn 
identity matrix. We let Tr be the (non-normahzed) trace for 
square matrices, defined by 



Tr(A)=^a. 



where an are the diagonal elements of the nxn matrix A. 
We also let tr be the normalized trace, defined by tr(A) — 
iTr(A). When A is non-random, we define its moments by 
Ap = tr(A?'), and more generally 



tr(APi)---tr {AP''). 



If A is instead random, its (expected) moments are given by 
([U, and more generally 

E[tr(APi)---tr(A'"=)]. 

X will denote a standard complex Gaussian matrix, meaning 
that it has i.i.d. complex Gaussian entries with zero mean and 
unit variance In particular, the real and imaginary parts of the 
entries are independent, each with variance i. 

From L = L1L2 observations of an ?i x random matrix 
Y, we can form the {nLi) x {NL2) compound observation 
matrix, denoted y^Li^L^^ by stacking the observations into a 
Li X L2 block matrix in a given order. Similarly, if D is 
non-random, we will denote by D^^ ^2 the compound matrix 
formed in the same way from D. We will be concerned with 
the following question: 

Given a random matrix model on the form Y = 
/(D, Xi, X2, ...), where D is non-random, and the Xj are 
Gaussian and independent. How can we best infer on the 
spectrum of D from independent observations Yi, Y^, of 
Y? 

We will restrict such inference to models where our methods 
apply directly to the compound observation matrix. This turns 
out to be easier when one of the Xj is an additive component 
in /. To examplify this, we will first state the two models we 
will analyze. For both, spectrum estimation methods from lfT2l 
will be applied. 



A. The additive model 

In applications such as MIMO channel modeling, the addi- 
tive model 

Y = /(D,X) = D + X, (2) 

applies, where D is non-random and X is Gaussian, both n x 
N . Given L = L1L2 observations of this model, the compound 
mati-ices Yl^Xs - Dl^^l^ , Xli,l2 satisfy 

YLi,L2 = /(DLi,L2,Xlj^L2). 

For (|2|i, we have moment-based methods to infer on the 
spectrum of ^DD^ from that of ^YY^ [12l, fl^. Since 
^Li,L2 ^Iso is Gaussian, the same methods can be used 
to infer on the spectrum of -jy^DLi,L2-DFi L2 from the 
compound observation matrix. But since 



tr 



NL2 



-D 



H 



N 



(3) 

spectrum estimation methods applied to the compound ob- 
servation matrix actually helps us to infer on the spectrum 
of -^DD^. We will state this estimator later on. To ease 
notation, we will let Dp = tr ((-^DD^)^) in the following. 
We will see that different stackings £1 , L2 give rise to different 
spectrum estimators, and compare their variances. 

B. A more involved model 
The random matrix model 



Y = /(D,Xi,X2) = DXi+X2 



(4) 



can be found in multi-user MIMO applications. D is non- 
random (n X m), and Xi (m x N) and X2 {n x N) are 
independent and Gaussian. This model can also be subject to 
stacking, although the first component DXi in the sum now 
is random. To see this, assume that we have L independent 
observations Yk = DXi,fc + X2,fc, 1 <k < L. Writing 



Yi...i = [Yi,Y2,...,Yi] 

= [Xi^i, Xi^2, 



(5) 



(where Yi...l is n x {NL), Xi^i...^ is m x (NL), X2,i...l is 
n X (NL)), we can write Yi...l = DXi,i...l + X2,i...l- This 
has the same form as the original model, so that the same 
type of spectrum estimation methods can also be used for 
the compound observation matrix. The underlying spectrum 
estimation method is now a two-stage process, where we in 
the first stage infer on the expected moments 



tr 



D 



H 



(6) 



and use these in a second stage to infer on the moments of 
DD^. This will be demonstrated further in Section [Vl 

In light of the two models we have mentioned, we will differ 
between the following: 

Definition 1: We will call a stacking of L = L1L2 obser- 
vations into a Li X L2 block matrix 

• a horizontal stacking if Li — 1, 
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• a vertical stacking if L2 ~ 1, or 

• a rectangular stacking if the limit c = lim exists as the 
number of observations grow to infinity, with < c < cxd. 

These three types of stackings are also denoted by H, V, and 
R, respectively. 

For dU, a horizontal stacking of observations lended itself, 
and seems to be the only natural way of stacking. This is in 
contrast to (|2]i where any rectangular stacking applied. For 
other models, neither rectangular, nor horizontal stacking way 
work [ 1 5 1 . We will not attempt to classify for which models the 
different types of stackings are possible, but rather to compare 
the different types of stackings whenever they apply. 

In the literature, horizontal stacking has been applied to 
dU ifTSj. The framework in |12| can be used to define 
unbiased estimators, useful for inference both for (|2]l and 
dUi. For (|2]l, we will denote by Dp such an estimator for 
Dp. Applying the framework to the compound observation 
matrices as sketched above, we get other unbiased estimators 
for Dp, denoted Dp,L^X2- The main result, stated in the next 
section, is concerned with finding the "optimal" stacking^in 
the sense of finding which Li,L2 give an estimator Dp^^^^L^ 
with lowest possible variance. We will let fp,-,L denote the 
variance of Dp^L^^L^, with L — L1L2 the number of ob- 
servations, and • the stacking (H, V, or R). We will in 
addition let A denote taking the average of L applications of 
Dp, and denote the variance of the corresponding estimator 
by Vp,A,L- All stackings will be compared with averaging 
also. The main result is stated before the formulations of the 
estimators themselves (see SectionlTvTi. since these expressions 
are rather combinatorial in nature, and thus need some more 
preliminaries before they can be stated. 



III. Statement of the main result 

The following result essentially says that any rectangular 
stacking is asymptotically the best one when our framework is 
used, and that horizontal and vertical stacking have a slightly 
higher variance. Averaging of observations gives a variance 
higher than this again. An even stronger conclusion can be 
drawn in the case of any given finite number of observations, 
where we prove that the stacking with "the most square" 
compound observation matrix gives the lowest variance. This 
challenges the classical way of stacking observations horizon- 
tally when forming the sample covariance matrix. 

When P is a polynomial in several variables, denote by the 
degree of P, or deg(P), the highest sum of exponents in any 
term therein. 

Theorem 1: All estimators Dp^Dp^^^x^ we consider are 
unbiased estimators for Dp = tr ((-i-DD^)'^) from obser- 
vations of (|2]i, and their variances Wp,.,L are all 0{L^^). 
Moreover, 



where • can be H, V, R, or A. For p > 2 we have that 



lim LvpRL 



nN' 



D2P-1 



lim Lvpy^L = —rfD2p-i + -r^D2p-2 

2p2 p2 

lim Lvp^H.L = -^D2p-i H —D2P-2 

L-i-oo ■ nN nN 



lim LvpAL 



2p2 

nN' 



D2P-1 



'Qp2p-3,...,^l), 



lim Lvi_.x 



2 1 

Di H , 

nN nN' 



where Q is a polynomial in D2P-3, -D2P-4, ■ ■ ■ ,Di of degree 
2p — 2, with only positive coefficients. In particular, all 
rectangular stackings asymptotically have the same variance, 
and 

lim Lvp_Rx < lim Lvpy,L < lim Lvp^A,L 
lim Lvp^ii^L < lim Lvp^n.L < lim Lvp^A,L 

(since Q{D2p^3, Di) > 0, since all Dp > 0). Also, the 
variance decreases with L for a fixed stacking aspect ratio, 
and, for a given L and any rectangular stackings Ri , R2 
into L = ij^' X 2^2^^ and L = l''^^ x Lj^^ observations, 
respectively. Vp^B.i.L < Vp,F/.2.L if and only if the {nL\^^) x 

{NL^'^) compound observation matrix is more square than 

(2) (2) 

the {nL\ ') x {NL2 ) compound observation matrix. Also, 
Wp,.,L < Vp^AX for any stacking. 

The proof of Theorem [T| can be found in Appendix IbI The 
polynomial Q above can be computed for the lower order 
moments, to compare the actual difference between averaging 
and stacking. Although we do not state the expression for 
Q, we have computed its values in an implementation in 
Section |V] in order to show the actual variances for the 
different stacking scenarios. 

Since Theorem [T] is a statement on the leading order term 
of the variances of moments of certain random matrices, it is 
in the same genre as the recently developed theory of second 
order freeness iflTl . ifTsl . lfT9l . Our matrix setting is, however, 
slightly different than the ones considered in these papers. 

We will not state expressions for the variance for the model 
(|4|i, since this is more involved. Instead, we will in Section |V] 
verify in a simulation that stacking seems to be desirable here 
as well. In the next section we will formulate the unbiased 
estimators for our models, which the main result refers to. 

IV. Formulation of the estimator 

To state our estimators, we need the following concept, 
taken from IIT2I : 

Definition 2: Let p be a positive integer. By a partial 
permutation we mean a one-to-one mapping tt between two 
subsets pi, p2 (which may be empty) of {1, ... We denote 
by \pi\ the number of elements in pi, and by SPp the set of 
partial permutations of p elements. 

TT is uniquely defined from the sets pi,p2, and a one-to 
one mapping g : pi — )> P2- We will therefore in the following 
denote a partial permutation by tt = Tr{pi, p2, q). We will 
need the following result, taken from IIT2I . where the general 
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Statement is for the case when D is random, independent from 
X: 

Proposition 1: Let X be an ?i x iV standard, complex, 
Gaussian matrix and D be an n x non-random matrix. Set 



Dr, 



tr 



1 

'n' 



-DD 



H 



tr 



-DD 



X tr 



N 



-DD 



H 



■ ,Ph 



tr 



N 



(D 



X tr 



X tr 



-(D 



-X)(D4 
+ X)(D 



1 



H 



pi 



X)(D + X)^ 



P2 



Pk 



We have that 



Mr, 



E 

xesPp 



,|<T(x)|-fe 



^ ^fe(p(7r))-fed(p(7r))^KpW)-''i(pW) 

xDiu-,ir, (7) 

where CT(7r), /9(7r), k{p{TT)), kd{p{Ti)), 1{p{tt)), ld{p{Ti)), are 
explained below, and where li, . . . ,lr are the cardinalities of 
the blocks of cr(7r) divided by 2. 

The following geometric interpretations explain the con- 
cepts in Proposition [T] and is a summary from |[T2l : 

• We draw k disconnected circles with 2pi,2p2, ■■■,2pk 
edges, respectively, and number the edges clockwise from 
1 to 2pi + - ■ ■+2pk- The set pi is visualized as a subset of 
the even edges (2, 4, 2p) under the mapping i — > 2i, p2 
is visualized as a subset of the odd edges (1, 3, 2p—l) 
under the mapping z — s> 2i — 1. 

• li'i') = j means that the corresponding even and odd 
edges 2i and 2j — 1 are identified, and with opposite 
orientation. 

• The vertices on the circles are also labeled clockwise, 
so that edge i borders to vertices i and i + 1. When 
edges are identified as above, we also get an identification 
between the vertices bordering to the edges. This gives 
rise to an equivalence relation on the vertices. p{tt) e 
CP(2pi + • ■ • + 2pk) is the corresponding partition of the 
equivalence classes of vertices, where 'J'{n) denotes the 
partition of n elements. 

• It turns out that a block of p either consists of odd 
numbers only (odd vertices), or of even numbers only 
(even vertices). k{p{n)) is defined as the number of 
blocks consisting of even numbers only, l{p{n)) as the 
number of blocks consisting of odd numbers only. 

• Edges from pi and p2 are called random edges, other 
edges are called deterministic edges. kd{p{TT)) is the 
number of even equivalence classes of vertices bordering 
to a deterministic edge, ld{p{TT)) is defined similarly for 
odd equivalence classes of vertices. 

» a — <t{tt) is the partition where the blocks are the 
connected components of deterministic edges after iden- 
tification of edges. 



• By the graph of random edges we will mean the graph 
constructed when we, after the identification of edges, 
join vertices which are connected with a path of determin- 
istic edges, and afterwards remove the set of deterministic 
edges. 

The quantities fc(p(7r)) — kd{p{TT)) and 1{p{-k)) — ld{p{Tr)) 
in ^ thus describe the number of even and odd vertices, 
respectively, which do not border to deterministic edges in 
the graph after the identification of edges. Note that when 
Pi = /92 = {1, ■■■,p}, ""(tt) is a partition of zero elements. In 
this case we define = 1. 

We now have all terminology in place in order to state a 
useful expression for our estimators: 

Lemma 1: Let Y = D + X be an observation of the model 
(lU, and let Yp be the moments 



Yn 



tr 



1 



-YY 



H 



(8) 



D. 



Pl,---,Pk 



Ipil: 



AtIpiI 



X ^fe(pW)-fe<i(pW)^HpW)-'d(pW) 



/l,...,ir 



(9) 



is an unbiased estimator for Dp^^,,,^p,_, i.e. E (^Dp^ p^ 

Dp^ for all p. In particular. Dp is an unbiased estimator 

for Dp. 

Similarly, given L = L1L2 observations of form the 
compound observation matrix Yli.l2 let instead Yp be 
the moments 



Yp = tr 



1 



NL2 



Li,L2 I Li,Ls 



(10) 



D, 



pi,...,Pk,Li,L2 



L 



fc-pi Pk 



E 



(_i)|pi|(!i^ 



cr(Tr)|-fc 



{NL2 



xlli,....;, (11) 



is also an unbiased estimator for D 



pi,---,Pfc 



for any Li, L2. In 
particular Dp^Li,L2 is an unbiased estimator for Dp. 

When we talk about averaging of observations Yi, Y^,, 
(i.e. the A in Vp,A.L), we mean computing ^pC^i) 
using (|9]l. It is clear that this is also an unbiased estimator 
for Dp, with variance Vp^A.L being times that of Dp, since 
observations are assumed independent. 



Note that there is a constant term in D 



pi,---,Pfc- 



coming 



from TT where pi = p2 = {1, The proof of Lemma [T] 

can be found in Appendix |A1 and builds on Proposition [T] 
The appendix concentrates on the proof of Q, since the 
proof of (fTTT i is immediate: the term trailing ij^^^ ^'^ 
in ( fTTT i is an unbiased estimator for the moments Fp ^ 
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tr 

entire 



tire right hand side of ( fTlT i 



once (|9|l is proved, so that the 
is an unbiased estimator for 



L 



Pk p 

jk-pi pfc p ...p 

^Pl ^Pk 



{Ll-^^F,,)-.-{Ll- 



'Ppk) 



where we have used (|3]l. 

There can also be a known noise variance a present, so 
that (|2]i takes the form Y = D + crX. (|9j can in this case be 
modified to 

|CT(7r)|-fc 



D 



pi, 



,Pk 



- E (- 

7v = Tv{pi ,P2 ,9) 



_l)|pilcr2|pi|^ 



X ^fe(pW)-fcd(pW)^/(p(^))-'d(pW) 

xl1i,...,i,. (12) 

The proof of this is omitted, since it follows the same lines. 

For the additive model dU, we will only be interested in the 
expressions for the estimators Dp. If we have a model where 
D is instead a randornjnatrix R, like (|4]i, one can formulate 
unbiased estimators Rp-^ „^ in the same way following |[T2l . 

unbiased now meaning E ( Rp-^ J — i?pi,...,pj., where 



Rpi,...,Pk — ^ 



tr 



xtr 



IrR^V I tr 



1 \ Pfc 

-RR^ 

N 



1 \ P2 

-RR^ 

N 



(13) 



The estimators (|9]l were also used in lfT6l . without mention- 
ing the form (|9]l. This form is useful in that it makes it clear 
that the expressions (|7]l and (|9]l are quite similar, enabling 
reuse of the implementation developed in lfT2l for computing 
Secondly, ^ can be used for obtaining an expression 
for the variances of Dp, staying within the same framework 
of partitions. We will state this expression and prove it in 
Appendix |A] In Appendix |B] Theorem [T] will be proved by 
analyzing this expression for the different stackings. 

V. Simulations 

For f[7\ an implementation of the concepts used in Proposi- 
tion[T]was made. In the following simulations, the computation 
of ( [TT] ) and ( fTol i has used this implementation, with the 
restriction to the particular class of partitions therein Q. 

Figure [T] shows results for the third moment estimator ( fTTl i 
applied to a diagonal matrix D, with diagonal entries assumed 
to be 2,1,1,0.5 (i.e. n = N — A). The estimator where 
applied to quadratic stackings of L ~ 1,4,9,16,..., all the 
way up to L = 900 observations. Although Theorem [T| says 
that the quadratic stacking is optimal, the difference between 
the different estimators may be hard to detect in practice, 
since differences may be small. Figure |2] gives a comparison 
for the actual variances for different number of observations 
and different stacking aspect ratios, verifying Theorem [T] The 

'a guide to the Matlab source code ranning the following simulations can 
be found in 1201 . 



Fig. 1. The estimator UOt with quadratic stacking applied with different 
number of observations. D is a 4 X 4 matrix. The actual third moment of 
-^DD^ is also shown. 



theoretical limits for rectangular and horizontal stacking and 
averaging are also shown. We have used the same 4x4 matrix, 
and computed the expression ( fTTj i to obtain the variance for the 
estimator for the third moment. As predicted by Theorem [T] 
the variance tends towards the theoretical lower bounds for 
rectangular and horizontal stacking when the number of ob- 
servations grow. For L = 50 observations, to verify the results, 
we have also plotted the empirical variances 



1 



K 



K 

1=1 



where {xi}^^i are K outputs from the estimator (i.e. a number 
of KL observations is needed, since each run of the estimator 
requires L observations), and x = ^iLi is the mean. We 
have set K ~ 1000, and indicated the empirical variances for 
Li = 1, 2, 5, 10, which correspond to c = 0.02, 2/25, 0.5, 2. 

A. The model (H 

We will compare horizontal stacking for ^ with that 
of averaging. As previously mentioned, this is a two-stage 
estimation, where we in the first stage get an unbiased esti- 
mate of the expected moments (|6]l in the case of horizontal 
stacking, and an unbiased estimate of the expected moments 
E ((D (;^XiXf ) D^)^) in the case of averaging. In any 
case, denote the involved matrix by S, define 



Sr) 



(14) 



and denote by Sp-^ p^ the corresponding unbiased estimator. 
In the second stage. Theorem 3 of 1 12| gives unbiased estima- 
tors Dp^^,,,^p^ for the moments of DD^ from the 5'pj,...,p^,, 
by stating an invertible matrix A so that 

[-C'pi,...,pfc] — A [S'pi,...,pfc], 

where [S'pi,...,pfc] are all expected moments (for all possible 
Pi,...,Pk), grouped into a column vector in a given order 
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(a) L = 5 



Horizontal stacking 
Averaging 
- Actual moment 



+ + 



20 30 40 50 60 70 

L 




(b) L = 50. Empirical variances are also shown for c = 0.02, 2/25, 0.5, 2. 

Fig. 2. Figures displaying Lv^ . j^ for the different estimators for the model 
^2), for different number of observations L. A diagonal matrix D with entries 
2, 1, 1, 0.5 on the diagonal has been chosen. The thi'ee rectangular lines are 
the theoretical limits limi^^^ Lv^ . for rectangular stacking, horizontal 
stacking, and averaging, as predicted by Theorem [T] in increasing order. It is 
seen that aspect ratio near 1 gives lowest variance, and that the variances 
decreases towards the theoretical limit predicted by Theorem [T] when L 
increases. 



Fig. 3. The unbiased estimator for j4) for the third moment of DD^, with 
D the 4x4 diagonal matrix with 2, 1, 1, 0.5 on the diagonal. The estimator 
is applied for up to 100 observations, for both cases of horizontal stacking 
and averaging of observations. 



Empirical variance horizontal stacking 
Empirical variance averaging 



Fig. 4. The empirical variances of the estimators for j4), which were shown in 
Figure[3] For each L the estimator was run 50 times on a set of L observations, 
and the empirical variance was computed from this. It seems that the empirical 
variance is lower for the case of horizontal stacking, suggesting that results 
on stacking valid for j2) may have validity for more general models also. 



In Figure |3] the unbiased estimators for horizontal stacking 
and averaging of observations have been compared for (|4|i. 
The simulation is run for the same 4x4 matrix, and it is 
seen that there is a high variance in the estimator for such a 
small matrix, even when the number of observations grows to 
L = 100. To get an idea on whether horizontal stacking gives 
something here also in terms of variance, we need to run the 
estimators many times, and compare their empirical variances. 
This has been done in Figure |4] where the empirical variance is 
computed from 50 runs of the estimator. The figure suggests 
that, indeed, the empirical variance is lower in the case of 
stacking. We will, however, not prove this mathematically. 



VI. Conclusion and further work 

We have analyzed an unbiased spectrum estimator for a 
model with additive Gaussian noise, and shown that the way 
the observations are stacked can play a role. More specifically, 
it is desirable to make the compound observation matrix as 
square as possible, as this will give rise to estimators with low- 
est possible variance. Asymptotically (i.e. when the number of 
observations grow to infinity), the variance of the estimators 
are the same, with only vertical and horizontal stacking and 
averaging displaying different asymptotic behaviour All cases 
of stacking were shown to reduce the variance when compared 
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to averaging. An estimator for the spectrum in a more general 
model was also applied, and a simulation suggested that 
stacking was desirable there as well. 

In this contribution, we arrived at a concrete "sum of 
terms"-expressions for the variance in our model, and the 
proof for the lower variance in stacked observation models 
boiled down to more terms vanishing when L oo in this 
expression, when stacking is considered. Once formulas for 
the variance for estimators in more general models are found, 
this "vanishing of terms" may be proved to be a much more 
general phenomenon, making the concept of stacking even 
more useful. Future papers may contribute further along this 
line by putting the concept of stacking into a more general 
framework, applicable to more general models. In such a 
framework, Q (for how the moments of D are connected 
to those of the compound matrix D^^ i^) should be replaced 
with more general methods, similarly to how (|4|i was handled 
using results from [12 1. 

This paper only considers estimators which perform aver- 
aging or stacking of observations. Future work could consider 
non-linear ways of combining observations, and compare re- 
sults on these with the results obtained here. Theorem[T] should 
also have some significance when the noise is not Gaussian, 
since many random matrices with non-Gaussian, i.i.d. entries 
display the same asymptotic behaviour as Gaussian matrices. 
Future work could also consider this, and explore to which 
extent results generalize to the finite regime. 



Appendix A 
The proofs of Lemma[T]and Lemma|2] 



To ease the expressions in the following, we will set 

To prove that the estimators Dp^ in (|9]l are unbiased, 

we will first find alternative recursive expressions for them, 
and prove by induction that these are unbiased. Assume that 
we have found unbiased estimators Z3gi,...,g, building on (|7|, 

whenever qi + ■ ■ ■ + qi < pi + V Pk- Define Z?pi,...,p^ by 

reorganizing (|7]i to 

^Pi,...,Pk — J^pi,...,pk Z^p>i irespp jvTpTI 

xP^in,N)DiZZi^. (15) 

Here the term for the empty partial permutation has been sep- 
arated from the other terms, and, by convention, Di-^ = 1 
whenever n = Tr{pi,p2,q) with pi = P2 = Taking 



expectations on both sides in (flST l we get 

= ^{Ypi,...,pk) 

p>l xesPp 

= Dpi,...,Pk 

^|<T(7r)|— fc 

+ E E -^^PAn,N)D,,,..j^ 

p>l xespp 

^|(T(7r)|-fc 

-E E -^;^PAn,N)D,,,..j^ 

p>l TreSPp 

7r = 7r(pi ,P2 ,g) 

where we have again used (|7|l- This shows that Dp^^,,,^p^, also 
is unbiased. We will now show that this recursive definition 

of Dp^ pj. coincides with (|9]l, which will complete the proof 

of Lemma [T] 

Recursively replacing the Z?/i,. in dTsl l until there are 
only terms on the form ypi,...,pj. left, we arrive at an expression 
on the form 

(16) 

where tti , tt; are non-empty partial permutations, and where 
li, ...Jr are the cardinalities of the blocks after the identifica- 
tion of edges from all tti , . . . , tt; . We will call a tt = tti , . . . , tt; 
a nested partial permutation, since it corresponds to a nested 
application of partial permutations. The factor (—1)' comes 
from I applications of ( fTsl l. where each application contributes 
a —1 from therein. Due to this alternating sign, many terms 
in (fT6] l will cancel. The following class of permutations will 
be useful to see these cancellations: 

Definition 3: Let 11; ^ be the set of nested partial permuta- 
tions on the form {tti, tt/}, where jp^, | + • • • + {p.^ \ = k. 
Also, when tt = {tti, vr;} are nested partial permutations 
which do not contain any identifications involving edges i 
or j, let H-TT.ij C Ili_k+2 U n;-|_i_fc-|_2 be the set of nested 
partial permutations which equals tt, with the exception that 
the identification is added. 

It is clear that any tt e ^-kAJ gives equal contribution in 
( ITSb up to sign, since each such tt embraces the same edges, 
and the order of the identification of edges does not matter for 
the final graph. It is also clear that 

|n^,,,j nn/+i,fc+2| = ; + 

and that the contributions from the two sets 11^ ^ j nn;,fc-|_2 and 
njr.ij n n/_|.i,fe_|_2 have opposite signs, since the sign for any 
TT e n;_fc is (—1)'. Adding the contributions, we get that the 
total contribution from H^^.ij equals that from just one nested 
partial permutation in Ili+i,k+2 where we set 7r/+i — 
Summing over all tt and I where tt = {tti, vr;} does not 
contain any identifications involving i or j, we get that the 
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contribution from the set of tt which contain equals the 
sum over {tti, 7r;_i, tt; = In the same way we can 

sum over tt with replaced by all other edge possibilities, 
to arrive at the sum over all tt — {tti , tt/}, where all \pt^. \ = 
1, and where we need only sum over sets (i.e. the order of 
the elements does not matter). In other words, and since there 
are I = \pi \ partial permutations nested in this way, we can 
replace ( fTSl l with 



tv = tt(pi ,P2 ,g) 



This coincides with (|9]l, and the proof of Lemma [T| is finished. 

We will have use for the following lemma, which states an 
expression for the variance of Dp. We will only state it for the 
case of no stacking, and leave the variance for the estimators 
using stackings to Appendix IbI 

Lemma 2: Let SPR2p be the set of partial permutations of 
{!,..., 2p} such that all identifications are from to 
{p+ 1, 2p}, or vice versa. The variance 



E Dp Dr. 



of Dp equals 



E 

7reSPR2p 



k(^)l-2 



-P^{n,N)Di,,^„j^. 



(17) 



Proof: Inserting (|9|l twice we get 



[Dp''] - E 

E 



D, 



E 



, (1), , (2), r7k('^i)|-l „|o-(7r2)|-l 

'l(_l)lpl I"- 



xP^,{n,N)P^,{n,N) 



x(E 



Y{1) ,(1)K(2) ,(2) 
^1 "--Ttri ^1 ,---,tr2 



-E 



^i<i) dl' 



^;(2) j(2) 



where . . . , zfj-* are the cardinalities of the blocks of (j{'Ki) 

('2) ('2) 

divided by 2, li-f those of cr(7r2). Using © we can 

write 



E 



Y(l) ,(1)Y,(2) ,(2) 



E 



jCT(7r)|-ri-r2 



TVIpiI 



-P.(n,7V)Ai,....;(l9) 



2p-|pi '|-|pi '1 

^='^(pi >P2 ig) 



where li, . . . ,lr are the cardinalities of cr(7r) divided by 2, and 



E 



^(1) ,(1) 



E E 



p-lpi'l p-lpi'l 

rrCl)=7r(pii,pi2,g) '^^^^='^(P21>P22'l) 

„|<T(7r<i))t-ri ^|^(^(2))|-r2 

A^lpiil Af|p2i| 
xP„a)(7i,Ar)F,,2,(n,iV) 



(20) 



where /J^-*, . . . , Zr^^ are the cardinalities of o'(7r'^^)) divided by 



/(I) 



Z,-2 those of a{iT^'^^). The powers of n and N in 



P,(i)(n,iV)P^(2,(n,7V), 



2, 

( |20l i can be written 

jJ<T(7r<i))| + |CT(7r(2')|-ri-r2 
Ar|pll| + |P2l| 

which are seen to match the powers of n and N in (fT9] l when 
TT — TTi X does not contain any identification of edges from 
different expectations. These terms thus cancel, and we are 
left with summing over tt containing identification of edges 
between the two expectations. 

To see that we need only sum over tt containing only 
identification of edges from one expectation to another, note 
that a TTi containing cancels the contribution from a tt 
containing since the former has an additional power of 

—1. The same can be said for 712- The only terms not canceling 
therefore occur when tti and 7T2 are empty, and vr only contains 
identifications between the two expectations. These correspond 
to SPR2p by definition. All of them contribute with a positive 
sign, and all in all we get that Vp equals 

„k(7r)|-2 

-P^{n,N)Di,^„„i^ 



E 

7reSPR2 



iVlPi 



(since ri — r2 = 1), which is what we had to show. ■ 

Appendix B 
The proof of Theorem[T] 

The geometric interpretation of tt e SPR2p is as an identi- 
fication among some of 4p edges, where even edges are only 
identified with odd edges and vice versa, and where there are 
only identifications between {!,..., 2p} and {2p + l,...,4p}, 
and vice versa. It is clear that tt £ SPR2p is invariant under 
cyclic shifts of the form tt sifcS2i7r(sifeS2;)^^, where 

r + fc for r £ {1, 2p} 
r for r £ {2p + 1, 4p} 

r for r £ {1, 2p} 
r + 1 for r e {2p + 1, ...,4p} 



sikir) 
S2iir) 



(addition performed so that result stays within the same 
interval, either [1, 2p] or [2p+ 1, Ap]) as long as k and I 
either are both odd, or both even, in order for the identification 
to remain between even and odd elements and vice versa. 
The equivalence class of tt £ SPR2p under cyclic shifts is 
given by Dk.iSikS2i'^{sikS2i)^^ ^ where k and / either are both 
odd, or both even. We will denote by SPE2p the set of such 
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equivalence classes, and denote by tt G SPE2p the equivalence 

class of TT € SPR2p. 

From the geometric interpretation of tt it is clear that, when 
we instead of tt use sifcS2;7r(sn:S2/)^^, 

1) |pi| and IP2I is the same for tt and sikS2iTr{sikS2i)~^, 

2) |(T(7r)| = |o-(sifeS2;7r(sifeS2/)~^)|. The block cardinaH- 
ties Zi, Ir of (T(7r) and cr{sikS2iTT{sikS2i)~^) are also 
equal, 

3) when k and I are both even, fc, fed, I, Id are the same for 
p{-k) and p(sifeS2;7r(sifcS2/)~^), 

4) when k and ^ are both odd, 

fc(p(7r) = Kp(sifeS2;7r(sn;S2/)~^)) 

kd{p{TT) = ld[p{sikS2lTr{sikS2iy^)) 

1[p{-k) = fc(p(sifeS2;7r(sifcS2/)"^)) 
ld{p{iT) = kd{p{sikS2iT^{sikS2ir^))- 

By definition of P^r, the last two statements say that 

Ps^,s,Ms^.s,,)-^{n,N) = P^{n,N) 

when fc, Z are both even, and 

when k, I are both odd. Since there are equally many elements 
with k, I odd and k, I even under cyclic equivalence, we see 
that 

" (21) 



is a polynomial symmetric in n and N, where ~ denotes 
equivalence under cyclic shifts. The first statements above 
say that the rest of the powers of n and N in ( fTTb are 
unchanged under cyclic equivalence. By summing over the 
cyclic equivalence classes in ( [TT] ), we see that it can be 
rewritten to 



E 

TfGSPEap 



jk(7r)|-2 



)*(n,iV)Ai,...,/. 



(22) 



with symmetric in n and iV. Moreover, has the form 
Qirin, N) — an^N^ + hn^N^, where a + 6 is the number of 
elements in the cyclic equivalence class of tt. 

Since DpLi,L2 is -^^i"^ times the estimator for the p- 
th moment Fp of the compound matrix by the comments 
following the statement of Lemma [T] the variance Wp,.,L of 
Dp,Li,L2 in (fTTT l is, after replacing n with nLi, and with 
ivi2 in 



j|a(7r)|-2_^kW 



L 



2-2p 



TTSSPEsp 



E 

JreSPEa, 



\P1\ 



-Q^[nLi,NL2 



|<T(7r)|-2rkWt-2 

1 -Q^{nLi,NL2) 



\pi\ 



xL 



2p-|pi|-k(7r) 



E 

7reSPE2p 



|<T(7r)|-2 



-Q^inLi,NL2)Di,^ 



(23) 



where we have used ([3), and set L = LiL2- 

deg((57f) describes the number of vertices in the graph 
of random edges not bordering to deterministic edges. Each 
vertex is associated with a value < Lmax(7i, A^), so that 
has order at most L to the power of the number of vertices 
not bordering to deterministic edges. We will use this in the 
following, and consider the following possibilities: 

1) There are no deterministic edges: in this case, p = 
pil/2. Since there are only crossidentifications between 
{l,...,2p} and {2p + l,...,4p} for tt £ SPR2p, any 
vertex in {2p + 1, ...,4p} is identified with a vertex in 
{!,..., 2p}, so that {l,...,2p} contains representatives 
for all equivalence classes of vertices. There are thus 
at most p even equivalence classes, and at most p odd 
equivalence classes. Thus 



Q^{nLi,NL2) < 0{{nLi)P{NL2)P) 



0{LP) 



When p = 1, \pi\ = 2, and |pi|/2 = \pi\ — 1, so that 
QitinLi, NL2) < O (il^'il^i), and it is easy to check 
that we have equality for the only partial permutation 
in SPR2, and that Q^{nLi,NL2) = nNL\P^\-^ for 
this 7f. When p > 1, \pi\/2 < |jOi| — 1, so that 
Q^{nLi,NL2) = 0{L\p^\~^) for such tt. 

2) The graph of random edges is a tree, and there exist 
deterministic edges: since any crossidentification be- 
tween {!,..., 2p} and {2p + l,...,4p} does not give 
rise to a leaf node when all edges are considered, any 
leafnode in the tree of random edges must be bordering 
to a deterministic edge. Since the tree contains \pi \ + 1 
vertices, and since there are at least two leafnodes in 
any tree, we have that Q.fi-{nLi, NL2) has order at most 
O with equality only if the graph of random 
edges borders to exactly two deterministic edges. It is 
easily seen that this occurs if and only if \pi \ pairs of 
edges are identified in successive order. 

3) The graph of random edges is not a tree, and there 
exist deterministic edges: if there are two cycles in the 
graph of random edges, QTi{nLi, NL2) has order at 
most O (Ll''il^2j (two subtracted for the cycles, one for 
the deterministic edge). Similarly, if there is one cycle, 
and more than one vertex bordering to a deterministic 
edge, QTt{nLi,NL2) has order at most 0[L^p^^^^). 
Assume thus that there is only one vertex bordering to 
a deterministic edge, and only one cycle. It is easily 
checked that this vertex must be on the cycle, and that 
we must end up in the same situation as in 2) where 
edges are identified in successive order, for which we 
actually have a tree. Thus, there is nothing more to 
consider. 

We see that Qf^{nLi, NL2) has order at most O (L^p^^~^) in 
any case. Inserting into ( |23] ). the first case above contributes 
with L^^-^ for p = 1, for p > 1 we get only terms of 
order 0{L~'^). The third case contributes only with terms 
of order 0{L~^). For the second case, contributions are of 
order 0{L^'^) when \pi\ pairs of edges are not identified in 
successive order When they are identified in successive order. 
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we consider the following different possibilities: 

• When \pi\ is odd we will have fc(p(7r)) ~ kd{p{n)) — 
l{p{n)) - ld{p{Ti)) = so that 

Q^{nLi,NL2) 

= {L2N) 2 (Lm) 2 

so that the term for tt in ( |23] | is of order 

^(|pi|-l)/2-|pi| ^ ^-lpil/2-1/2^ -^ijgjj l^^l ^ 1^ jhis is 

O and the contribution in this case is -^^jjj- times 

the number of partitions in the equivalence class of tt 
When \pi\ > 1, all terms are of order 0{L^^). 

• When \pi\ is even, either 

1) fc(p(7r))-fcd(pW) = ^-1, Z(pW)-W(pW) = 

for which 

Q^{nLi,NL2) 

, |Pl I 1 , ^ IP1 I 

= {L2N) — -\Lin) — 

, . |pi 1 ipi ^ ipi 1 ^ 
^ N- — ^n—L- — ^Li, 

so that the term for tt in (1231 ) is of order 

When the stacking is not vertical, we have that 
Li < 0(ii/2), so that the term for tt is of 
order < 0{L-\p'^^^-^L^^^) = 0(i-l''il/2-i/2) < 
0(L"'^/^) When the stacking is vertical, the term is 
of order i^lPil/^, which is 0{L-^) when \pi\ > 2. 
When \pi\ ~ 2, the contribution in (l23T l is seen 
to be -^^^ times the number of partitions in the 
equivalence class of tt. 

2) k{p{TT)) - kd{p{TT)) = ^, l{p{TT)) - /d(p(7r)) = 

— 1, for which the term for tt in (|23] | similarly 
is shown to be of order 

^lpil/2-l-lpil^^ ^^-lpil/2-1^^^ 

and, similarly, only horizontal stacking with |/0i| = 
2 gives contributions of order 0{L^^). The contri- 
bution in (|23] | is seen to be times the number 
of partitions in the equivalence class of tt. 

When it comes to the number of elements in the corresponding 

equivalence classes, it is easy to see that 

• there are 2p^ elements for the class where \pi \ ~ 1, cor- 
responding to any choice of the 2p edges {1, 2p}, and 
any choice of the p even or odd edges in {2p+l, 4p}. 

• elements for each class where \pi \ —2. 
Summing up, we see that for p = 1, — L^^-^Di + 
^"^rOv '•yP^ stacking/averaging. For p > 2 we get 
that 

w = l~'%D2,-, + o(l-^/^) 

w - l-^^^D2,-, + l-^^D2,-2 + o(l-^/^) 

W = L-'^D2,-, + L-'^D2,-2 + 0{l-^/-'), 



and the first formulas in Theorem [T] follows after multiplying 
both sides with L, and taking limits. The case of averaging 
follows by noting that there are only positive coefficients in the 
formula ( |23] ) for the variance, and that the variance is divided 
by L when one takes L independent observations. 

Finally, we prove why the least variance is obtained when 
the compound observation matrix is as square as possible. 
With ci = nLi,C2 — NL2,c ~ we can write each 
QitinLi, NL2) as a scalar multiple of 

C1C2 + Cj^cJ 

= inN) — L— (ci= C2' +c,- c,^ ) 

^ k + l k + L / k-l l-k \ 

= {nN)~L~ ic~ + c~ j , 

It is clear that /(c) = c''°~'-'/^+c'-'^'^-'/^ has a global minimum 
at c = 1 on (0, 00), and the result follows. 
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